import sys
sys.path.append('../audio_processing/')
import audio_processing as ap
import librosa
import IPython.display as ipd
import numpy as np
import musdb
import matplotlib.pyplot as plt
import residual as res
By Coby Wilcox (4812764) PSTAT 174
In this report, I'll try to look at the errors of my generated sources and try to answer questions like:
Where is the error mainly coming from? What are the different types of error in audio processing? How do these different types of error sound, appear visually, and mathematically? Are there ways to address the error created?
To start, let's look at the mask I create in the spectrogram limits report.
stems, track, sr = res.load_waltz()
This seems like the natural place to start. If we represent these audio files visually in their spectrogram/wav form do we see much difference between the two? Can we go up close and see more differences, or are the differences apparent enough where this isn't necessary.
Here we can see the residuals mapped out onto one visual so you can see exactly what belongs to the target source and what belongs the estimated source and what do they share.
spec_data, masks = res.make_masks(stems)
Interpretting these masks can be a bit challenging as its hard to make out what is exactly what is going on in these spectrograms. I think to understand them the best one is to look at is the bass spectrograms. In the target, we can see all the sound is coming from the low frequencies which is to be expected from the bass.
Then for mask spectrograms its best to see them as a weight matrix with only neutral or negative weights. Remember these are always applied to the mix spectrogram shown at the top left corner. Blue means the values aren't affected and red means the values are negatively weighted. So the bass mask will mostly remove audio signals that correlate to high frequencies while preserving most of the low frequencies.
I think to understand the masks a little more it can help to remind yourself how these are being calculated. The masks themselves are made using the target spectrograms themselves, which is typically impsossible, but for the sake of demonstration is okay. They are calculated using a simple nonlinear formula where the weight is equal to the percentage the target makes up of the overall mix signal.
$$m_i = \frac{s_i}{max(s, s_i)}, s=\text{Mix STFT Representation}, s_i=\text{Target STFT Representation}$$So if the target signal is more prevalent at the frequency than the mix is, the mask does not affect that frequency. If the mix is more prevalent though, then the mix is just multiplied by the fraction the target makes up of that frequency. You may also be confused by the appearance of negative decibels in the colorbar. Generally negative decibels are sounds with magnitudes below the human ability to perceive them, but in this case, it can actually very eloquently explain what the mask does to an extent. Human beings hear on a $log_{10}$ scale which is what decibels represent, i.e. if a sound has time times more magnitude we hear it as just twice as loud. And a nice property of logs you may remember is:
$$log(a/b) = log(a)-log(b)$$Which is exactly what we are doing in these representations. We are dividing the target source by the overall mix, so the masks are essentially just subtracting the right amount of decibels from the mix itself.
s_hat = res.apply_masks(spec_data, masks)
Looks pretty good! Almost identical, however if we graph the residuals hopefully the differences will become a little more apparent.
residuals = res.plot_residuals_spec(spec_data, s_hat)
Looking at this the differences are much more apparent. There are a couple of things to note.
est_signals, true_signals = res.playback_sources(spec_data, s_hat)
True Signal: Drums
Estimated Signal: Drums
True Signal: Bass
Estimated Signal: Bass
True Signal: Other
Estimated Signal: Other
True Signal: Vocals
Estimated Signal: Vocals
Listening to the generated sources, they are nearly identical to the exact sources. But there are little bits of distortion and artifacts within the new signals, they can be hard to hear, the one that is the most easily perceptible is the vocals as our ears are more sensitive to changes in peoples voices than in instrument sounds. Most likely these are the little bits of noise that we found in the residual analysis.
So, where are these pieces of noise coming from exactly? These estimations should have little to no error, as they are generated using the actual source signals themselves. I'll try to answer this question in the next sections.
res.plot_audio_signals(est_signals, true_signals)
res.display_csd(est_signals, true_signals)
res_signals = res.plot_residual_signals(est_signals, true_signals)
sources = ['Drums','Bass','Other','Vocals']
for i,resid in enumerate(res_signals):
print('Residual:', sources[i])
ipd.display(ipd.Audio(resid, rate=sr))
Residual: Drums
Residual: Bass
Residual: Other
Residual: Vocals
Looking at these residuals what you can realize is, they are just a tiny sliver of the actual target audio that didn't seem to make it in. This makes sense, when you remember how the mask worked in for the estimated sources. All it does it subtract loudness out of specific frequencies in the audio. So on frequencies that have multiple sources on the same frequency level, some loudness will be subtracted and it may just be the target source by accident. There is another component to we can understand these issues as well: by looking at the phase of the mixture.
When reviewing the method of the separation itself, I recalled how in especially contested frequencies the model would simply apply a proportional decrease in volume which may be the main source of the residuals. To check this hypothesis I wanted to see how much the frequencies of residuals of one source may correlate with other sources. To examine this I first graphed the periodograms of each sources residuals to see which frequencies were the most present. A periodogram is a time series model that is based on using the fourier transformed frames of a signal and then approximating the contribution of these frequency signals. Mathematically we define a periodogram as:
$$|d(\omega_j)|^2 = |\frac{1}{\sqrt{n}}\sum_{t=1}^{n}{x_t \cdot e^{-2\pi i\omega_j t}}|^2$$Where $x_t$ are the $n$ windowed overlapping frames of the audio signal, and $\omega_j$ are the frequencies. Or
$$I(\omega_j) = \sum_{h=-(n-1)}^{n-1}{\hat{\gamma}(h) \cdot e^{-2\pi i\omega_j h}}$$Where $\hat{\gamma}(h)$ is the sample autocorrelation, and $\omega_j$ are the frequencies.
Due to the amount of observations being used the periodograms can appear somewhat cluttered if no transformations are applied. To smooth over the periodogram to increase readability I applied a moving average function to the spectral density values.
psd, freq = res.periodogram_res(res_signals, sr=44100)
smooth_psd, smooth_freq = res.norm_smooth_periodogram(psd, freq, window_size=1024)
Analyze the correlation between the residuals and the target source. High correlation in specific regions might indicate systematic errors in your model.
res.plot_res_acf(est_signals, true_signals)
Finally to then see if the sources residuals appear to correlate with were their frequencies correlate to other sources, we used a cross spectrum. A cross spectrum is a idea we learned in class that combines two previously established ideas of the cross correlation function and the periogram. Essentially showing us what frequencies two different time series share or do not share. Mathematically a cross spectrum can be represented as:
$$f_{x,y}(\omega) = \sum_{h=-\infty}^{\infty}{\gamma_{x,y}(h) \cdot e^{-2\pi i\omega_j h}}$$Where $\gamma_{x,y}(h)$ is the cross correlation function.
for i in range(4):
res.source_csd(res_signals, smooth_psd, smooth_freq, i)
Looking at the ACF and PACF, this tells use the residuals are not stationary, so the residuals are not random noise. Therefore is definitely some information not being fully taken into account in the masks. And that the information is obviously sinal, which isn't surprisingly considering the data we are dealing with is music. I think in general, we can assume there is going to be some information missing from the model outputs always. This shouldn't be a warning sign as this can signal that frequencies that we are incapable of even hearing may be present or not present in the output.
res.display_csd(est_signals, true_signals)
print("Normal Signal")
ipd.display(ipd.Audio((stems[0]), rate=sr))
added_signal = np.sin(2*np.pi*19*(np.arange(len(stems[0]))))
mute_signal = stems[0] + added_signal
print("Normal Signal w Added Signal")
ipd.display(ipd.Audio((mute_signal), rate=sr))
Normal Signal
Normal Signal w Added Signal
Check if there are temporal patterns in the residuals that could indicate specific time-based issues in your separation model, such as lag or phase issues.
phase_spec, phase_est = res.get_phase(spec_data, s_hat)
phase_error = res.calculate_phase_error(phase_spec[1:], phase_est)
res.plot_res_over_phase(phase_error, residuals)
# # Enable the 'ipympl' backend
# %matplotlib widget
# res.display_interactive_plot_res(residuals, phase_error)
There are a few ways to look at residuals mathematically that we learned in class and that people frequently use in signal processing.
The MSE also called the L2 norm is maybe one of the most commonly used error/loss functions used in statistics. Nonetheless it's a good idea to examine the statistic to get a better understanding of the residuals. For this task the MSE would be in the form:
$$\text{MSE or }L2_{time} = \frac{1}{KT}\sum_{t,k}|\hat{y}_{t,k} - y_{t,k}|^2$$$$L1_{time} = \frac{1}{KT}\sum_{t,k}|\hat{y}_{t,k} - y_{t,k}|$$Many modules will have this function built into to calculate MSE or L1 norms.
mse, me = res.calc_norms(residuals)
Drums: MSE=0.5322632539636895, l1=0.0059182979713163905 Bass: MSE=1.4550852278186694, l1=0.021212134586395287 Other: MSE=21.749803741009515, l1=0.06929522844168201 Vocals: MSE=5.187617823185366, l1=0.03445190099871528
These are statistics used in music source separation but also in many other different applications of signal processing. With these terms it probably best to introduce a few terms of error in music source separation.
Noise: Also called white noise, noise is a feeding of random independently generated values or just adding other signals onto a signal, at a regular interval and with constant variance and mean. Noise will typically be a feature of any real world distribution, and typically represents random variation in a distribution over time.
You can simulate noise in an audio signal very easily by adding a term to the signal data itself.
print("Normal Signal")
ipd.display(ipd.Audio((stems[0]), rate=sr))
print("Signal w/ added white noise")
white_noise = np.random.normal(0,0.05,size=len(stems[0]))
ipd.display(ipd.Audio((stems[0] + white_noise), rate=sr))
Normal Signal
Signal w/ added white noise
Interference: In music source separation, interference refers to the amount of other non-desired sources that appear in estimated source. So for example if you heard the drums in the estimation for the bass source, that would be interference.
You can simulate noise this by simply adding a bit of one sources' audio to another.
print("Bass Signal")
ipd.display(ipd.Audio((stems[2]), rate=sr))
interference = stems[2] + (stems[1]*white_noise) # Adding the drums to the bass track at a random proportion of white noise.
print("Signal w/ interference")
ipd.display(ipd.Audio(interference, rate=sr))
Bass Signal
Signal w/ interference
Artifacts: Unwanted alteration in an audio signal, these will typically arise from compression, transmission errors, or sloppy processing mistakes.
One type of artifacts you can simulate is clipping, where the amplitutde of an audio exceeds the maximum value that the system can record.
print("Normal Signal")
ipd.display(ipd.Audio((stems[0]), rate=sr))
clipped = np.clip(stems[0], -0.25, 0.25)
print("Clipped Signal")
ipd.display(ipd.Audio(clipped, rate=sr))
Normal Signal
Clipped Signal
We then use these error terms to come up with this interpretation of our estimated sources.
$$\hat{s_j} = s_{\text{target}} + e_{\text{interf}} + e_{\text{noise}} + e_{\text{artif}}$$Where $\hat{s_j}$ is our estimated source for source $j$, $s_{\text{target}}$ is our true source audio w/ allowed distortion*, $e_{\text{interf}}, e_{\text{noise}}, e_{\text{artif}}$ are the error terms for interference, noise and artifacts respectively.
*When I say allowed distortion essentially it just covers any difference in the estimated source that is not covered by these three error terms.
To calculate these terms we use these formulas:
Let $\Pi\{y_1,...,y_k\}$ denote a orthogonal projector onto the subspace spanned by $y_1,..,y_k$.
$$P_{s_j} = \Pi \{ s_j \}$$$$ P_{s} = \Pi \{ s_{j\prime} \} , \text{where } j\neq j\prime $$$$ P_{s,n} = \Pi \{ s_{j\prime} \} , (n_i)_{1\le n \le m}, \text{where } n_i\text{ is noise from ith microphone} $$And remember: $$\Pi \{ s_j \}y_k = \frac{ \langle s_j,y_k \rangle y_k}{\langle s_j,s_j \rangle} $$
Then use these for:
$$s_{\text{target}} = P_{s_j}\hat s_j$$$$e_{\text{interf}} = P_{s}\hat s_j - P_{s_j}\hat s_j$$$$e_{\text{noise}} = P_{s,n}\hat s_j - P_{s}\hat s_j$$$$e_{\text{artif}} = \hat s_j - P_{s,n}\hat s_j$$$$\text{So: } \hat s_j = (P_{s_j}\hat s_j) + (P_{s}\hat s_j - P_{s_j}\hat s_j) + (P_{s,n}\hat s_j - P_{s}\hat s_j) + (\hat s_j - P_{s,n}\hat s_j)$$Finally the metrics are calculated with:
$$SDR = 10log_{10}\frac{||s_{\text{target}}||^2}{||e_{\text{interf}} + e_{\text{noise}} + e_{\text{artif}}||^2}$$$$SIR = 10log_{10}\frac{||s_{\text{target}}||^2}{||e_{\text{interf}}}$$$$SNR = 10log_{10}\frac{||s_{\text{target}} + e_{\text{interf}}||^2}{||e_{\text{noise}}||^2}$$$$SAR = 10log_{10}\frac{||s_{\text{target}} + e_{\text{noise}} + e_{\text{interf}}||^2}{||e_{\text{artif}}||^2}$$Note: This can get somewhat confusing as SNR is also a commonly used metric in signal processing, however this version is different than the normal kind. The normal SNR is calculated with:
$$SNR = 10log_{10}\frac{||s_{j}||^2}{||\hat s_{j} - s_{j}||^2}$$import pandas as pd
references = np.array(stems[1:]).T
drums_eval, bass_eval, other_eval, vocals_eval = [res.calc_evals(est_signals[i], references, i) for i in range(4)]
metrics = pd.DataFrame(data = [drums_eval, bass_eval, other_eval, vocals_eval],
columns=["SISDR", "SISIR", "SISAR", 'SDSDR', "SNR", "SRR"],
index = ["Drums", "Bass", "Other", "Vocals"]
)
# Calculating the mean of each column
mean_row = metrics.mean()
# Converting the mean series to a DataFrame
mean_df = pd.DataFrame(mean_row).T
# Assigning a label for the index of the mean row
mean_df.index = ['Total Score']
# Appending the mean row to the original DataFrame
metrics = pd.concat([metrics, mean_df])
metrics
| SISDR | SISIR | SISAR | SDSDR | SNR | SRR | |
|---|---|---|---|---|---|---|
| Drums | 6.059788 | 12.621812 | 7.142728 | 6.823634 | 6.970171 | 14.714377 |
| Bass | 6.810572 | 12.789123 | 8.074067 | 7.477198 | 7.624050 | 14.694119 |
| Other | 9.564220 | 13.848223 | 11.590896 | 9.928236 | 10.019616 | 19.098387 |
| Vocals | 8.368302 | 14.999946 | 9.431727 | 8.844583 | 8.942912 | 18.426378 |
| Total Score | 7.700721 | 13.564776 | 9.059855 | 8.268413 | 8.389187 | 16.733315 |
For reference a very good score for a Music Source Separator for SDR is about 6. However the most state-of-the-art MSS's have reported SDR scores of up to 9.20!